setup

This is the introduction paragraph. In this file we’re cleaning the data and setting up files for later processing.

Load, clean, recode and save

Load files

# Load the neighborhoods GeoJSON
# creates data["Parcels"] containing geojson data.
data = {}
for feature in [selector] + features:
    geofile = get_newest_file( precious_folder, feature, ext=".geojson" )
    #logger.debug(geofile)
    logger.info(f"Found {feature}:  {geofile}" )
    data[feature] = gpd.read_file( geofile )

# convert all feature files to same CRS mapping as Civic_Associations
for feature in features:
    data[feature] = data[feature].to_crs( data[selector].crs )
INFO     | __main__:<module>:7 - Found Civic_Associations:  /home/john/projects/ssg-thefan-analysis/precious/Civic_Associations-2025-05-16.geojson
INFO     | __main__:<module>:7 - Found Addresses:  /home/john/projects/ssg-thefan-analysis/precious/Addresses-2025-09-25.geojson
INFO     | __main__:<module>:7 - Found Parcels:  /home/john/projects/ssg-thefan-analysis/precious/Parcels-2025-09-25.geojson

Drop columns

# columns to drop:
shared_drops = ['CreatedBy','CreatedDate','EditBy','EditDate']
## file specific column drop mappings
drop_columns = {
    "Civic_Associations" : ['OBJECTID'] + shared_drops,
    "Addresses"          : ['OBJECTID'] + shared_drops,
    "Parcels"            : ['OBJECTID'],
}
for feature in [selector] + features:
    data[feature] = data[feature].drop(columns=drop_columns[feature])

Spatial join, select only parcels and addresses in the Fan.

# Pull out only FDA from Civic_Associations and store it
data[selector_key] = data[selector][ data[selector]["Name"] == selector_key ]

# Select only features from the selector_key (FDA).  perform spatial join.
for feature in features:
    predicate = "overlaps" if feature=="Neighborhoods" else "within"
    data[feature+"_in_fan"] = gpd.sjoin(data[feature], data[selector_key], predicate=predicate, how="inner")

Drop unnecessary columns

# Drop columns created from spatial join
shared_drops = ["index_right","AdoptionDate","ChangeDate","Shape__Area","Shape__Length"]
drop_columns = {
    "Addresses"          : ['GlobalID'] + shared_drops,
    "Parcels"            : ['MaskedOwner','GlobalID_left','GlobalID_right'] + shared_drops,
}
for feature in features:
    feature_name = feature+"_in_fan"
    data[feature_name] = data[feature_name].drop(columns=drop_columns[feature])

Create columns and recode columns in Parcels

gdf = data["Parcels_in_fan"]

gdf["OwnerOccupied"] = gdf.apply(
    lambda row: str(row["MailAddress"]).startswith(str(row["AsrLocationBldgNo"]))
                and str(row["MailCity"]).upper() == "RICHMOND"
                and str(row["MailState"]).upper() == "VA"
                and str(row["MailZip"]) == "23220",
    axis=1
).map({True: 1, False: 0})


# Rule 1: If PropertyClass contains 'Commercial'
gdf.loc[gdf["PropertyClass"].str.contains("Commercial", case=False, na=False), "LandUse"] = "Commercial"

# Rule 2: If PropertyClass contains 'Condo'
gdf.loc[gdf["PropertyClass"].str.contains("Condo", case=False, na=False), "LandUse"] = "Multi-Family"

gdf["SharedGeometry"] = gdf.duplicated(subset="geometry", keep=False).astype(int)

# Create FanUse

mapping = {
    "Single Family": "FanResidential",
    "Multi-Family": "FanResidential",
    "Duplex (2 Family)" : "FanResidential",
    "Commercial": "FanBusiness",
    "Industrial": "FanBusiness",
    "Office" : "FanBusiness",
    "Institutional" : "FanBusiness",
    "Mixed-Use" : "FanMixed-Use"
}
gdf["FanUse"] = gdf["LandUse"].map(mapping).fillna("FanOther")

# Ensure PropertyClass is string and safe for NaNs
mask = gdf["PropertyClass"].fillna("").str.contains("vacant|parking|common|garage|storage|tower|space", case=False, na=False)
# Apply recode
gdf.loc[mask, "FanUse"] = "FanOther"


# A new variable to permit easy selecting

gdf["FanUseType"] = "FanIgnore"  # default
gdf.loc[(gdf["OwnerOccupied"] == 1) & (gdf["Mailable"] == 1), "FanUseType"] = "FanOwner"
gdf.loc[(gdf["OwnerOccupied"] == 0) & (gdf["Mailable"] == 1), "FanUseType"] = "FanRental"

# Reset if it's one of the FanOther property classes.
gdf.loc[mask, "FanUseType"] = "FanIgnore"


mapping = {
    "FanResidential": 1,
    "FanBusiness": 10,
    "FanMixed-Use" : 20,
    "FanOther": 99
}
gdf["FanUseOrder"] = gdf["FanUse"].map(mapping).fillna(99)

data["Parcels_in_fan"] = gdf

Clean columns in Addresses

Save Parcels_in_fan and Addresses_in_fan for later use.

for feature in features:
    feature_name = feature+"_in_fan"

    # store to parquet using pyarror (workflow tip from chatgpt
    data[feature_name].to_parquet(f"{feature_name}.parquet",engine="pyarrow")
    logger.info(f"Saving: {feature_name}.parquet" )

    # create dataframe without spatial geometries and store to CSV
    gdf = data[feature_name].drop(columns="geometry")
    gdf.to_csv(f"{feature_name}.csv", index=False)
    logger.info(f"Saving: {feature_name}.csv" )
INFO     | __main__:<module>:6 - Saving: Addresses_in_fan.parquet
INFO     | __main__:<module>:11 - Saving: Addresses_in_fan.csv
INFO     | __main__:<module>:6 - Saving: Parcels_in_fan.parquet
INFO     | __main__:<module>:11 - Saving: Parcels_in_fan.csv

Examine Addresses

In the following sections we review the Addresses file. If changes are necessary, iterate with the cleaning sections and rerun the report until everyting is clean.

List count addresses by zipcode

Loading ITables v2.5.2 from the internet... (need help?)

List odd addresses

This should return blank. These were cleaned/fixed above.

show_result_set("""
select 
  *
from
  addresses
where
  not ZipCode in ('23220','23284')
order by
  StreetName,AddressLabel

""",pageLength=10)
Loading ITables v2.5.2 from the internet... (need help?)

List of Street Names

Loading ITables v2.5.2 from the internet... (need help?)

List AddressLabels with mismatched AddressBase

Loading ITables v2.5.2 from the internet... (need help?)

List Addresses with mismatched AddressExtension

Loading ITables v2.5.2 from the internet... (need help?)

List addresses that don’t add up

Loading ITables v2.5.2 from the internet... (need help?)

List of valid unit types

Loading ITables v2.5.2 from the internet... (need help?)

List of Addresses with unittype in AddressLabel and missing ExtensionWithUnit

Loading ITables v2.5.2 from the internet... (need help?)

View All Addresses

Loading ITables v2.5.2 from the internet... (need help?)

Examine Parcels

FanUse by FanUseType

Loading ITables v2.5.2 from the internet... (need help?)
  • FanOwner - Parcel owner address matches building address number, parcel owner zip is 23220, and parcel has mailable USPS address for owner of record. These parcel owners have their tax record mailed to this parcel address, so they’re probably the owner.
  • FanRental - Parcel owner address doesn’t match building address number. So, parcel owner address is OUTSIDE the Fan. It’s possible that the owner uses a different address for tax bill. NOTE - the parcel database doesn’t contain addresses for these parcels. We don’t have Fan addresses for these parcels, just the address of the tax owner. For example, the address for Joe’s Inn isn’t in parcel database. The parcel owner is outside the Fan.
  • FanIgnore - not a mailable address, probably a park, parking lot, common area, etc.

FanUse, LandUse by FanUseType

Loading ITables v2.5.2 from the internet... (need help?)

FanUse, LandUse, PropertyClass by FanUseType

Loading ITables v2.5.2 from the internet... (need help?)

View All Parcels

Use the Sort field to select subsets of parcels.

Loading ITables v2.5.2 from the internet... (need help?)

Create merge file

Loading ITables v2.5.2 from the internet... (need help?)
Loading ITables v2.5.2 from the internet... (need help?)
Loading ITables v2.5.2 from the internet... (need help?)
Loading ITables v2.5.2 from the internet... (need help?)
Loading ITables v2.5.2 from the internet... (need help?)
Loading ITables v2.5.2 from the internet... (need help?)

Finally, create interim table

Data Dictionary

Perfect 👍 — here’s a full schema documentation for parcel_address_join_with_flags, structured for readability and ready to drop into a README, Quarto doc, or data dictionary.


📑 Table Schema: parcel_address_join_with_flags

Description This table links Parcels (tax assessment records) with Addresses (city-recognized mailing addresses), enriched with classification fields and quality-control flags.

  • One row for every Parcel ↔︎ Address combination.
  • Contains all relevant attributes from both source datasets plus derived fields for analysis.
  • Stored in Parquet (ParcelAddressJoin_in_fan.parquet) for reuse.

1. Parcel Attributes

(From the Parcels dataset)

Column Type Description
ParcelID INTEGER Unique parcel identifier in the city GIS.
PIN VARCHAR Parcel Identification Number (tax lot ID).
CountOfPIN INTEGER Number of PINs associated with this parcel (multi-PIN parcels).
OwnerName VARCHAR Name of the current owner of the parcel.
MailAddress VARCHAR Mailing street address of the owner (may differ from property location).
MailCity VARCHAR City portion of the owner’s mailing address.
MailState VARCHAR State portion of the owner’s mailing address.
MailZip VARCHAR ZIP code of the owner’s mailing address.
AssessmentDate VARCHAR Date of most recent assessment.
LandValue DOUBLE Assessed land value.
DwellingValue DOUBLE Assessed building value.
TotalValue DOUBLE Total assessed value.
LandSqFt DOUBLE Parcel land area in square feet.
ProvalAsmtNhood VARCHAR Assessment neighborhood code.
TaxExemptCode VARCHAR Tax exemption code (if applicable).
PropertyClassID VARCHAR Internal property classification ID.
PropertyClass VARCHAR Human-readable property classification (e.g., R Two Story, B Commercial Common Area Main).
LandUse VARCHAR Land use category (e.g., Single Family, Multi-Family, Commercial, Vacant).
OwnerOccupied BIGINT Flag (1/0) if the property is owner-occupied.
FanUse VARCHAR Local neighborhood-specific use classification.
FanUseType VARCHAR Derived Fan District use type (e.g., FanOwner, FanRenter).
FanUseOrder BIGINT Ordering value for FanUse categories.
parcel_geometry GEOMETRY Polygon geometry of the parcel boundary.

2. Address Attributes

(From the Addresses dataset)

Column Type Description
AddressId VARCHAR Unique address record identifier.
AddressLabel VARCHAR Formatted mailing address (human-readable, e.g. 401 N Robinson St Apt A).
BuildingNumber VARCHAR Street/building number.
StreetDirection VARCHAR Street direction (e.g., N, S, E, W).
StreetName VARCHAR Street name.
StreetType VARCHAR Street type (e.g., St, Ave, Blvd).
ExtensionWithUnit VARCHAR Address extension (e.g., Rear, Suite, Apt).
UnitType VARCHAR Unit type (e.g., Apt, Unit, Suite).
UnitValue VARCHAR Unit number or identifier.
ZipCode VARCHAR Postal ZIP code.
Mailable VARCHAR “Yes” if USPS deliverable, “No” otherwise.
StatePlaneX DOUBLE X coordinate (State Plane projection).
StatePlaneY DOUBLE Y coordinate (State Plane projection).
Latitude DOUBLE Latitude coordinate (WGS84).
Longitude DOUBLE Longitude coordinate (WGS84).
address_geometry GEOMETRY Point geometry of the address location.

3. Derived Fields (Flags and Counts)

Column Type Description
is_mailable INTEGER 1 if Mailable = 'Yes', else 0.
address_count INTEGER Number of addresses linked to the parcel.
mailable_count INTEGER Number of mailable addresses linked to the parcel.
match_category VARCHAR Classification of parcel–address relationship. Possible values:
- No match
- Single address (mailable)
- Single address (not mailable)
- Multiple addresses (1 mailable)
- Multiple addresses (many mailable)
- Multiple addresses (none mailable)
- Unclassified.
expectation_flag VARCHAR Quality flag comparing match_category against LandUse expectations. Possible values:
- Expected
- Unexpected
- Review.
label_parcel_count INTEGER Number of distinct parcels sharing the same AddressLabel.

4. Row Semantics

  • Each row = one parcel–address combination.
  • Parcels with multiple addresses → appear multiple times.
  • Addresses tied to multiple parcels (rare, e.g., condos, common-area splits) → appear multiple times.
  • Together, this table enables analysis at the parcel-level, address-level, or household-level.

✅ With this schema, you can:

  • Join members (by AddressLabel or AddressId) and check anomalies (expectation_flag).
  • Collapse to households (using AddressLabel, label_parcel_count).
  • Do penetration analysis across LandUse / PropertyClass.

👉 Do you want me to also sketch a schema diagram (like an ERD-style box with key columns and relationships between parcels, addresses, and the join) so you have a visual to drop into documentation?

How many rows share the same AddressLabel?

Loading ITables v2.5.2 from the internet... (need help?)